!git clone https://github.com/mattdepaolis/llm-evaluation.git
!cd llm-evaluationThe discontinuation of Hugging Face’s Open LLM Leaderboard has left a gap in the community for standardized evaluation of large language models (LLMs). To address this, I developed the LLM Evaluation Framework, a comprehensive and modular tool designed to facilitate reproducible and extensible benchmarking of LLMs across various tasks and benchmarks.
The LLM Evalaution Framework can be found on my Github account: LLM Evaluation Framework
🧩 Why This Framework Matters
The Open LLM Leaderboard was instrumental in providing a centralized platform for evaluating and comparing LLMs. Its retirement has underscored the need for tools that allow researchers and developers to conduct their own evaluations with transparency and consistency. The LLM Evaluation Framework aims to fill this void by offering: - Modular Design: Inspired by microservice architecture, enabling easy integration and customization. - Multiple Model Backends: Support for Hugging Face (hf) and vLLM backends, allowing flexibility in model loading and inference. - Quantization Support: Evaluate quantized models (e.g., 4-bit, 8-bit with hf, AWQ with vLLM) to assess performance under resource constraints. - Comprehensive Benchmarks: Includes support for standard benchmarks like MMLU, GSM8K, BBH, and more. - Leaderboard Replication: Easily run evaluations mimicking the Open LLM Leaderboard setup with standardized few-shot settings. - Flexible Configuration: Customize evaluations via CLI arguments or programmatic usage. - Detailed Reporting: Generates JSON results and Markdown reports for easy analysis. - Parallelism: Leverages vLLM for efficient inference, including tensor parallelism across multiple GPUs.
🚀 Getting Started
Installation 1. Clone the Repository:
- Set Up a Virtual Environment:
!python -m venv .venv
!source .venv/bin/activate # On Windows use `.venv\Scripts\activate`- Install Dependencies:
!pip install -e lm-evaluation-harness
!pip install torch numpy tqdm transformers accelerate bitsandbytes sentencepiece
!pip install vllm # If you plan to use the vLLM backend🧪 Example Usage
Using the Command-Line Interface (CLI)
Evaluate a model on the HellaSwag benchmark:
!python llm_eval_cli.py \
--model hf \
--model_name google/gemma-2b \
--tasks hellaswag \
--num_fewshot 0 \
--device cuda # Use 'cpu' if you don't have a GPUThis command will download the gemma-2b model (if not cached), run it on the HellaSwag benchmark with 0 few-shot examples, and save the results in the results/ and reports/ directories.
Using as a Python Library
Integrate the evaluation logic directly into your Python scripts:
from llm_eval import evaluate_model
import os
# Define evaluation parameters
eval_config = {
"model_type": "hf",
"model_name": "google/gemma-2b-it",
"tasks": ["mmlu", "gsm8k"],
"num_fewshot": 0,
"device": "cuda",
"quantize": True,
"quantization_method": "4bit",
"batch_size": "auto",
"output_dir": "./custom_results" # Optional: Specify output location
}
# Run the evaluation
try:
results_summary, results_file_path = evaluate_model(**eval_config)
print("Evaluation completed successfully!")
print(f"Results summary: {results_summary}")
print(f"Detailed JSON results saved to: {results_file_path}")
# Construct the expected report path
base_name = os.path.splitext(os.path.basename(results_file_path))[0]
report_file_path = os.path.join(os.path.dirname(results_file_path).replace('results', 'reports'), f"{base_name}_report.md")
if os.path.exists(report_file_path):
print(f"Markdown report saved to: {report_file_path}")
else:
print("Markdown report not found at expected location.")
except Exception as e:
print(f"An error occurred during evaluation: {e}")📊 Reporting and Results
The framework generates: - JSON Results: Detailed results for each task, including individual sample predictions (if applicable), metrics, and configuration details, saved in the results/ directory. - Markdown Reports: A summary report aggregating scores across tasks, generated in the reports/ directory.
📄 How the Evaluation Report Looks
When you run an evaluation using the LLM Evaluation Framework, it generates comprehensive yet easy-to-understand reports in both Markdown and JSON formats. Here’s a broad overview of what you can expect from the Markdown report:
1. 📊 Summary of Metrics
This section provides a clear table summarizing the evaluation results for each individual task. Each row includes: • Task: The specific benchmark or task evaluated (e.g., leaderboard_bbh_boolean_expressions). • Metric: The evaluation metric used (e.g., accuracy, exact match). • Value: The model’s performance score on that task.
2. 📈 Normalized Scores
This section provides normalized scores, giving you an easy-to-interpret percentage representation of the model’s performance relative to the benchmark standards. It includes: • Benchmark: The benchmark’s name. • Score: The normalized percentage score.
This helps quickly identify the relative strengths and weaknesses of the evaluated model.
3. 🔍 Task Samples (Detailed Examples)
The report also offers detailed, human-readable examples from evaluated tasks, allowing you to qualitatively assess the model’s outputs: • Question: Clearly presents the evaluation sample question. • Ground Truth: The correct answer or expected response. • Model Response: The exact response provided by your evaluated model, clearly labeled as correct or incorrect.
This section is especially valuable for error analysis and understanding how your model handles specific types of queries.
⚙️ Customization
These reports can be customized or extended further by modifying the reporting logic, enabling deeper analyses or alternative formats as needed.
🔧 Extending the Framework
The modular design makes it easier to add new functionalities:
- Adding New Tasks/Benchmarks:
- Define the task configuration in llm_eval/tasks/task_registry.py or a similar configuration file.
- Ensure the task is compatible with the lm-evaluation-harness structure or adapt it.
- Supporting New Model Backends:
- Create a new model handler class in llm_eval/models/ inheriting from a base model class (if applicable).
- Implement the required methods for loading, inference, etc.
- Register the new backend type. 
- Customizing Reporting:
- Modify the report generation logic in llm_eval/reporting/ to change the format or content of the Markdown/JSON outputs.
🤝 Contributing
Contributions are welcome! Please follow standard practices:
- Fork the repository.
- Create a new branch for your feature or bug fix (git checkout -b feature/my-new-feature).
- Make your changes and commit them (git commit -am ‘Add some feature’).
- Push to the branch (git push origin feature/my-new-feature).
- Create a new Pull Request.